1 Description

The data set wine-quality-white-and-red.csv contains information from physicochemical and sensory tests performed on the white and red variants of the Portuguese “Vinho Verde” wine. It can be found here.

It contains the following variables:

Categorical or sensory values:

  • Type

Numerical or physcochemical tests:

  • fixed.acidity
  • volatile.acidity
  • citric.acid
  • residual.sugar
  • chloride
  • free.sulfur.dioxide
  • total.sulfur.dioxide
  • density
  • pH
  • sulphates
  • alcohol

2 Exploratory Analysis

The data set consists of 6497 observations of 13 different variables:

##     type      fixed.acidity    volatile.acidity  citric.acid    
##  red  :1599   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  white:4898   1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500  
##               Median : 7.000   Median :0.2900   Median :0.3100  
##               Mean   : 7.215   Mean   :0.3397   Mean   :0.3186  
##               3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900  
##               Max.   :15.900   Max.   :1.5800   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  1.00      Min.   :  6.0       
##  1st Qu.: 1.800   1st Qu.:0.03800   1st Qu.: 17.00      1st Qu.: 77.0       
##  Median : 3.000   Median :0.04700   Median : 29.00      Median :118.0       
##  Mean   : 5.443   Mean   :0.05603   Mean   : 30.53      Mean   :115.7       
##  3rd Qu.: 8.100   3rd Qu.:0.06500   3rd Qu.: 41.00      3rd Qu.:156.0       
##  Max.   :65.800   Max.   :0.61100   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9923   1st Qu.:3.110   1st Qu.:0.4300   1st Qu.: 9.50  
##  Median :0.9949   Median :3.210   Median :0.5100   Median :10.30  
##  Mean   :0.9947   Mean   :3.219   Mean   :0.5313   Mean   :10.49  
##  3rd Qu.:0.9970   3rd Qu.:3.320   3rd Qu.:0.6000   3rd Qu.:11.30  
##  Max.   :1.0390   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.818  
##  3rd Qu.:6.000  
##  Max.   :9.000

The data set is primarily conformed of white wine. Out the 6497 observations, 4898 are of white wine. This represents a 75% of the total data set.

Wine type distribution

Figure 2.1: Wine type distribution

When we analyze the density of each of the continuous variables, we see that all of them are rights hand skew. Meaning high concentration of points in the left tail or lower values of the variable. Might also be interpreted as a signal of outliers with high values. Another interesting observation is that most of them seem as bimodal distributions, which might be due to a different mode in each of the 2 types of wine.

Numerical variable densities

Figure 2.2: Numerical variable densities

In the following plot we see the same plots as above, but divided for each type of wine. We can group the variables and their behavior for each type of wine in the following groups:

  • High concentration in the left tail for white wines, even distribution for red wine: fixed.acidity, volatile.acidity, citric.acid, density.
  • High concentration in the left tail for red wines, even distribution for white wine: residual sugar, free.sulfur.acid, total.sulfur.acid.
  • Same (or similar) density for both types of wine: chlorides, density, pH, sulphates, alcohol.
Numerical variable densities by type of wine

Figure 2.3: Numerical variable densities by type of wine

Now, let’s see if there’s any kind of relationship between all possible pair of variables.

  • All Wine: The first scatter plot matrix represent the relationship between variables without separating by type of wine. In it we cannot appreciate any obvious kind relationship between any pair of variables.

  • White Wine: In the second plot we’re only plotting the relationship between the variables for the white wine. Interestingly it resembles a lot the previous plot, although that’s to be expected given that nearly 75% of our data belongs to this group.

  • Red Wine: At last we can examine the relationship between the variables isolating just the red wine. Although it does resemble a lot our two previous plots, there seems to be a close to linear relationship between density vs fixed.acidity and ph vs fixed.acidity. Well, maybe not completely linear, but more of a correlation between both pair of variables.

Now lets see if there’s any different overall behavior between both types of wine. For this, we’ll use the PCP and Andrew’s plots.

The Parallel Coordinates Plot (PCP) can be useful to find highly correlated variables and distinct group behaviors. In our case we have so many observations that identifying correlated variables is almost impossible at plain sight. On the other hand, we can also notice that the behavior of both groups is pretty similar in most of the variables, but not in all of them, White wine has a larger arrange of values for the variables of free.sulfur.dioxide, total.sulfur.dioxide and residual.sugar. And red wine has a small group of observations (maybe outliers?) in the chlorides variable. It also seems to reach higher values than white wine for sulphates and volatile.acidity. Even saying all that, there isn’t such a clear cut between both types of wine.

That conclusion is also enforced by the Andrews plot in the lower side of our figure. In it we graph the finite Fourier series define by each observation. In this case, we can say that the the red wine lines still behave between what could be expected for the white wine plots.

3 Characteristics

3.1 Mean Vector

Below we have the mean vector (mean value for each of the random variables) for the whole data set and each type of wine in it. From it we can see some actually interesting facts, for example that the mean value of residual.sugar for both wines is pretty different. Red wine has three times more sugar (in average) than white wine. In the same line as sugar we find both sulfur dioxide variables, free.sulfur.dioxide & total.sulfur.dioxide. In this case red wine has ~x2.5 and ~x3 higher values than the average white wine observation. On the contrary, white wine has considerably higher values of volatile.acidity (~x2) and chlorides (~x2).

As for the rest of the variables, there is no considerably difference to be made between both subgroups.

Table 3.1: Mean vector for all wine and by each subgroup
All White Red
fixed.acidity 7.2153 8.3196 6.8548
volatile.acidity 0.3397 0.5278 0.2782
citric.acid 0.3186 0.2710 0.3342
residual.sugar 5.4432 2.5388 6.3914
chlorides 0.0560 0.0875 0.0458
free.sulfur.dioxide 30.5253 15.8749 35.3081
total.sulfur.dioxide 115.7446 46.4678 138.3607
density 0.9947 0.9967 0.9940
pH 3.2185 3.3111 3.1883
sulphates 0.5313 0.6581 0.4898
alcohol 10.4918 10.4230 10.5143
quality 5.8184 5.6360 5.8779

3.2 Covariance Analysis

Following, we have the Covarince Matrices for the whole data set and for each type of wine. From them we can gather that most of the variables are independent from each other given that most of the entries are close to or exactly 0. But we do have some noticeable exceptions:

  • High Positive Covariance: This pairs of variables move together, meaning, and increase in one supose an increase in the other.
    • All Wine: residual.sugar & total.sulfur.dioxide, free.sulfur.dioxide & total.sulfur.dioxide.
    • Red Wine: residual.sugar & total.sulfur.dioxide, free.sulfur.dioxide & total.sulfur.dioxide.
    • White Wine: free.sulfur.dioxide & total.sulfur.dioxide.
  • High Negative Covariance: Pair of variables with inverse relationship. Meaning, an increase in one of them means a decrease in the other one.
    • All Wine: fixed.acidity & total.sulfur.dioxide, alcohol & total.sulfur.dioxide.
##                      fixed.acidity volatile.acidity citric.acid residual.sugar
## fixed.acidity                 1.68             0.05        0.06          -0.69
## volatile.acidity              0.05             0.03       -0.01          -0.15
## citric.acid                   0.06            -0.01        0.02           0.10
## residual.sugar               -0.69            -0.15        0.10          22.64
## chlorides                     0.01             0.00        0.00          -0.02
## free.sulfur.dioxide          -6.51            -1.03        0.34          34.02
## total.sulfur.dioxide        -24.11            -3.86        1.60         133.24
## density                       0.00             0.00        0.00           0.01
## pH                           -0.05             0.01       -0.01          -0.20
## sulphates                     0.06             0.01        0.00          -0.13
## alcohol                      -0.15            -0.01        0.00          -2.04
## quality                      -0.09            -0.04        0.01          -0.15
##                      chlorides free.sulfur.dioxide total.sulfur.dioxide density
## fixed.acidity             0.01               -6.51               -24.11    0.00
## volatile.acidity          0.00               -1.03                -3.86    0.00
## citric.acid               0.00                0.34                 1.60    0.00
## residual.sugar           -0.02               34.02               133.24    0.01
## chlorides                 0.00               -0.12                -0.55    0.00
## free.sulfur.dioxide      -0.12              315.04               723.26    0.00
## total.sulfur.dioxide     -0.55              723.26              3194.72    0.01
## density                   0.00                0.00                 0.01    0.00
## pH                        0.00               -0.42                -2.17    0.00
## sulphates                 0.00               -0.50                -2.32    0.00
## alcohol                  -0.01               -3.81               -17.91    0.00
## quality                  -0.01                0.86                -2.04    0.00
##                         pH sulphates alcohol quality
## fixed.acidity        -0.05      0.06   -0.15   -0.09
## volatile.acidity      0.01      0.01   -0.01   -0.04
## citric.acid          -0.01      0.00    0.00    0.01
## residual.sugar       -0.20     -0.13   -2.04   -0.15
## chlorides             0.00      0.00   -0.01   -0.01
## free.sulfur.dioxide  -0.42     -0.50   -3.81    0.86
## total.sulfur.dioxide -2.17     -2.32  -17.91   -2.04
## density               0.00      0.00    0.00    0.00
## pH                    0.03      0.00    0.02    0.00
## sulphates             0.00      0.02    0.00    0.01
## alcohol               0.02      0.00    1.42    0.46
## quality               0.00      0.01    0.46    0.76
##                      fixed.acidity volatile.acidity citric.acid residual.sugar
## fixed.acidity                 3.03            -0.08        0.23           0.28
## volatile.acidity             -0.08             0.03       -0.02           0.00
## citric.acid                   0.23            -0.02        0.04           0.04
## residual.sugar                0.28             0.00        0.04           1.99
## chlorides                     0.01             0.00        0.00           0.00
## free.sulfur.dioxide          -2.80            -0.02       -0.12           2.76
## total.sulfur.dioxide         -6.48             0.45        0.23           9.42
## density                       0.00             0.00        0.00           0.00
## pH                           -0.18             0.01       -0.02          -0.02
## sulphates                     0.05            -0.01        0.01           0.00
## alcohol                      -0.11            -0.04        0.02           0.06
## quality                       0.17            -0.06        0.04           0.02
##                      chlorides free.sulfur.dioxide total.sulfur.dioxide density
## fixed.acidity             0.01               -2.80                -6.48       0
## volatile.acidity          0.00               -0.02                 0.45       0
## citric.acid               0.00               -0.12                 0.23       0
## residual.sugar            0.00                2.76                 9.42       0
## chlorides                 0.00                0.00                 0.07       0
## free.sulfur.dioxide       0.00              109.41               229.74       0
## total.sulfur.dioxide      0.07              229.74              1082.10       0
## density                   0.00                0.00                 0.00       0
## pH                        0.00                0.11                -0.34       0
## sulphates                 0.00                0.09                 0.24       0
## alcohol                  -0.01               -0.77                -7.21       0
## quality                   0.00               -0.43                -4.92       0
##                         pH sulphates alcohol quality
## fixed.acidity        -0.18      0.05   -0.11    0.17
## volatile.acidity      0.01     -0.01   -0.04   -0.06
## citric.acid          -0.02      0.01    0.02    0.04
## residual.sugar       -0.02      0.00    0.06    0.02
## chlorides             0.00      0.00   -0.01    0.00
## free.sulfur.dioxide   0.11      0.09   -0.77   -0.43
## total.sulfur.dioxide -0.34      0.24   -7.21   -4.92
## density               0.00      0.00    0.00    0.00
## pH                    0.02     -0.01    0.03   -0.01
## sulphates            -0.01      0.03    0.02    0.03
## alcohol               0.03      0.02    1.14    0.41
## quality              -0.01      0.03    0.41    0.65
##                      fixed.acidity volatile.acidity citric.acid residual.sugar
## fixed.acidity                 0.71             0.00        0.03           0.38
## volatile.acidity              0.00             0.01        0.00           0.03
## citric.acid                   0.03             0.00        0.01           0.06
## residual.sugar                0.38             0.03        0.06          25.73
## chlorides                     0.00             0.00        0.00           0.01
## free.sulfur.dioxide          -0.71            -0.17        0.19          25.80
## total.sulfur.dioxide          3.27             0.38        0.62          86.53
## density                       0.00             0.00        0.00           0.01
## pH                           -0.05             0.00        0.00          -0.15
## sulphates                     0.00             0.00        0.00          -0.02
## alcohol                      -0.13             0.01       -0.01          -2.81
## quality                      -0.08            -0.02        0.00          -0.44
##                      chlorides free.sulfur.dioxide total.sulfur.dioxide density
## fixed.acidity             0.00               -0.71                 3.27    0.00
## volatile.acidity          0.00               -0.17                 0.38    0.00
## citric.acid               0.00                0.19                 0.62    0.00
## residual.sugar            0.01               25.80                86.53    0.01
## chlorides                 0.00                0.04                 0.18    0.00
## free.sulfur.dioxide       0.04              289.24               444.87    0.01
## total.sulfur.dioxide      0.18              444.87              1806.09    0.07
## density                   0.00                0.01                 0.07    0.00
## pH                        0.00                0.00                 0.01    0.00
## sulphates                 0.00                0.11                 0.65    0.00
## alcohol                  -0.01               -5.23               -23.48    0.00
## quality                   0.00                0.12                -6.58    0.00
##                         pH sulphates alcohol quality
## fixed.acidity        -0.05      0.00   -0.13   -0.08
## volatile.acidity      0.00      0.00    0.01   -0.02
## citric.acid           0.00      0.00   -0.01    0.00
## residual.sugar       -0.15     -0.02   -2.81   -0.44
## chlorides             0.00      0.00   -0.01    0.00
## free.sulfur.dioxide   0.00      0.11   -5.23    0.12
## total.sulfur.dioxide  0.01      0.65  -23.48   -6.58
## density               0.00      0.00    0.00    0.00
## pH                    0.02      0.00    0.02    0.01
## sulphates             0.00      0.01    0.00    0.01
## alcohol               0.02      0.00    1.51    0.47
## quality               0.01      0.01    0.47    0.78

3.3 Correlation Analysis

Below we can find the correlation plots for every pair of variables for each sub group in the data set. Observation worth mentioning:

  • In all the subgroups exist a high correlation between free.sulfur.dioxide and total.sulfur.dioxide, but thats to be expected given that they represent the presence of the same element. The same can be said between citric.acid and fixed.acidity.
  • There is also a high inverse correlation between density and alcohol. Meaning the more alcohol wine has, the less dense it is.
  • In white wine, the more sweet it is (residual.sugar) the more dense it is. The density is also correlated to the amount of total.sulfur.acid in the wine.
  • In red wine, higher fixed.acidity indicates lower values of pH and higher density. Interestingly, high presence of citric.acid correlates to lower values of volatile.acidity, indicating the nature of citric acid.
Correlation Plots by Wine Subgroup

Figure 3.1: Correlation Plots by Wine Subgroup

3.4 Outliers (by group)

We’ll use the Minimum Covariance Determinant (MCD) estimators to analyze the effect of out liers in our data. Also, this analysis must be performed separately for each type group.

3.4.1 Red Wine Outliers

Comparing the 12 eigen values for the covariance matrix of the whole red wine values in the data set against the values for the MCD matrix, we can see that there is a reduction in them using only the most centered values.

Table 3.2: Eigen Values Comparison
Red Wine Eigen Values 1133.807 57.93541 3.101302 1.819415 1.0463404 0.0413967 0.0231927 0.0113465 0.0100780 0.0014550 6e-07
MCD Red Wine Eigen Values 624.183 40.58744 3.314625 1.094648 0.1904331 0.0418075 0.0150331 0.0100670 0.0076606 0.0002149 4e-07
Red Wine Data Set and MCD Eigen values comparison

Figure 3.2: Red Wine Data Set and MCD Eigen values comparison

Now we compare the correlation between the variables using only the heaviest weighted observations against the full red wine information. Here we see that some of the correlations between variables increases, meaning the out liers are diminishing this relationships. That’s the case for the fixed.acidity & density, chlorides & density and sulphates & alcohol relationships.

Corraletion comparison between red wine subgroup

Figure 3.3: Corraletion comparison between red wine subgroup

Using the %1 highest Chi Square for the 11 variables as threshold to classify as outlier, we get that there are 344 out liers in the red wine subgroup. This represents the 21.51% of our observations.

Outliers by Mahalanobis Distance

Figure 3.4: Outliers by Mahalanobis Distance

Now we visualize the same plot as in the previous chapter, but this time we colored the outliers as RED points in the plot. We notice that these outliers are the the points in the edge of the mass observed in each relationship and the “good” data seem to be in the middle of the group. We could name these observations as the most similar among itself.

Using the PCP and the Andrews plot we see a more distinct behavior between the outliers. In the PCP, those observation with extremely high values (specially in the chlorides and the residual.sugar variables) are the ones identified as outliers. On the Andrews plot, we see the outliers (blue lines) as the functions in the extremes, be it on the high side or the lower side of the group.

3.4.2 White Wine Outliers

Using the same methodology as above, we use the MCD to find betters estimates to the parameters of our data. In this case, the heaviest weighted data does not improve our estimation of the covariance, as seen below on the comparison of eigen values for each matrix.

Table 3.3: Eigen Values Comparison
White Wine Eigen Values 1931.513 168.4529 21.56099 1.07442 0.6867086 0.0185319 0.0142898 0.0114461 0.0086420 0.0003961 3e-07
MCD White Wine Eigen Values 2060.696 145.1536 23.06863 1.10776 0.6676802 0.0187713 0.0114369 0.0083450 0.0053658 0.0000749 2e-07
White Wine Data Set and MCD Eigen values comparison

Figure 3.5: White Wine Data Set and MCD Eigen values comparison

Even so, some relationship do become stronger in this subset. As is the case of the correlation between chlorides & alcohol and residual.sugar & chlorides.

Corraletion comparison between white wine subgroup

Figure 3.6: Corraletion comparison between white wine subgroup

Again, we use the %1 highest Chi Square for the 11 variables as threshold to classify as outlier, we get that there are 536 out liers in the white wine subgroup. This represents the 10.94% of our observations.

White Wine Outliers by Mahalanobis Distance

Figure 3.7: White Wine Outliers by Mahalanobis Distance

Now we visualize the behaviour of the outliers between every pair of variables. The most interesting one is the outliers along the chlorides variable. You can notice that the “good” data is a very concentrated group in the left side of the plot and the outliers (red points) are all disperse to the right side. A similar behaviour can be seen along the volatile.acidity variable.

Scatter Plot Matrix of white wine variables (with outliers)

Figure 3.8: Scatter Plot Matrix of white wine variables (with outliers)

At last, we examine the differences by group using the PCP and Andrews plot. In the PCP we confirm our observation made on the previous plot that a clear indicator or outlier observations are high values of chlorides and volatile.acidity. In the Andrews plot we notice the same behaviour as in the red wine outlier, on which the outliers are those function on the extremes of the group.

PCP of White Wine (with outliers)

Figure 3.9: PCP of White Wine (with outliers)

Andrews Plot of White Wine (with outliers)

Figure 3.10: Andrews Plot of White Wine (with outliers)

3.4.3 Outliers Note

In both subgroups we classified outliers as those observations with a squared mahalanobis distance larger than the 99-th percentile of the Chi Square distribution with 11 degrees of freedom. We could say that the amount of outliers this classification made was extremely high in both cases, which might be an indicator that our distances do not behave as a standard normal distribution.

For the sake of this project, we’ll still remove this observations from the analysis.

4 Principal Component Analysis

Now we’ll perform a principal component analysis (PCA) to reduce the dimensionality of our data.

From this analysis we get 12 principal components, each one independent from each other and explaining a certain percentage of the variability in our data. That is the information we can see on the left side plot below. On the right side plot we have the accumulated variability explained by each new dimension. From this we decided to use only the first 4 PC’s which explain 78% of our variability. So from this analysis we reduced our dimensions from 12 to 4.

4.1 Variable weight by Dimesion

In the below plot we visualize the loading of each variable in our data in the four dimensions selected previously. We can extract the following conclusions from it:

  • Dim1 : It is highly correlated to the variables total.sulfur.dioxide and free.sulfur.dioxide. And highly inversely correlated to volatile.acidity and chloride.
  • Dim2 : Is correlated with residual.sugar and density and inversely correlated with alcohol.
  • Dim3: Is correlated with pH. Inversely correlated with fixed.acidity and citric.acid.
  • Dim4: Is inversely correlated with sulphates and pH.

(This grouping could be interpreted as specific characteristics of the wine. Such as taste, texture, etc. But for that, one must have a certain chemical and winery expertise.)

Variable Loading for each Principal Component

Figure 4.1: Variable Loading for each Principal Component

4.2 PC Separation by Type of Wine

If we plot each PC against each other and color each observation based on its type of wine, we can see if any component actually helps distinguish between this qualitative variable. In the scatter plot below we can see that more clear cut is always based in any comparison between any PC and Dim1. We could interpret this as the Dim1 explaining the difference between both types of wine mostly.

Scatter Plot of Principal Components by Type of Wine

Figure 4.2: Scatter Plot of Principal Components by Type of Wine

Finally, we can test the correlation between each component against the original variables on the data set. Interestingly, after the 4th component, there doesn’t seems to be any relevant correlation between the values, enforcing our decision of taking into account only the first four pc’s. When we compare this correlation values against the loading analyzed previously, we get that all the loadings higher than the threshold established (at \(\sqrt\frac{1}{p}\)) have a high direct or inverse correlation (\(\lvert x_i \rvert \ge 0.5\)) with its principal component.

Corraletion Plot between PC's and OG Variables

Figure 4.3: Corraletion Plot between PC’s and OG Variables

5 Clustering

Lastly, we’ll use clustering algorithms to find hidden groups inside our data.

5.1 Partitional Clustering

5.1.1 Number of Clusters

First, we determined how many group there might be. For this we used 3 approaches:

  • Number of clusters that stabilizes the within-cluster sum of squares (WSS).
  • Number of clusters with highest average silhouette. (A higher silhouette means that a observation belong to the best possible cluster)
  • Number of clusters that maximizes the Gap statistics.
    Optimal number of clusters with different methods

    Figure 5.1: Optimal number of clusters with different methods

From this 3 methods we get different conclusions:

  • The WSS stabilizes after the 3rd cluster.
  • The maximum average silhouette is when k = 2, BUT its practically equal when k = 3.
  • The Gap statistic says the discrepancy happens at 7 clusters.

With a vote of 2 out 3, we decided to move forward with a k = 3.

5.1.2 K-Means, PAM & CLARA

Now we try 3 different clustering algorithms, setting the centers at 3.

  • K-Means: Clustering around centroids.
  • PAM: Partitioning Around Medoids. Clustering around medoids. The difference is, a centroid is an artificial point calculated and the medoid is an actual point in the data set.
  • CLARA: Clustering for LARge Applications. An implementation of medoid clustering for large data sets. Same concept but it samples through subsets of the data to generate the optimal set of medoids.

We also try a second implementation of PAM, but instead of building it with the quantitative data (all this methods work just with quantitative values), we calculate the Gower Distance for the whole data set (including categorical variables) and use that matrix as input for the PAM algorithm.

We visualize the result of this algorithms plotting the observations using the first 2 principal components. These pc’s explain 54% of the variability in the data.

Clustering Algorithms with K = 3

Figure 5.2: Clustering Algorithms with K = 3

From the results of the clustering algorithms we conclude that the PAM implementation using the Gower distance has the better grouping (at least from the perspective of these PC’s). Although that is not a completely fair comparison, giving that it does have more information that the other algorithms. The rest of the methods have seemingly the same result.

We can compare the performance of this algorithms using its average silhouette width. The closest this value is to 1, the better grouping the algorithm did. Taking this observation into account, the PAM with Gower clustering has the lowest score of all. Its “better” performance in the previous plot might be due to the perspective on which the analysis was made, viewing only through the first 2 principal component. The rest of the method have the same average silhouette width of 0.51, meaning all of them did a pretty good job clustering.

If we had to choose only one method based on this metric, it would be the K-Means with k = 3. It has the same average silhouette than the other 2 methods, but it missclasified less values than PAM and CLARA. We notice this in the back tail on the left side of the silhouettes of each cluster.

5.2 Hierarchical Clustering

Now we’ll use hierarchical clustering algorithms to group our data. The difference with these algorithms is that they do not require to fix the number of groups in advance. These methods work by merging smaller groups into larger ones or dividing larger groups into smaller ones of similar data. The procedure creates a hierarchy of clusters represented with dendograms.

5.2.1 Agglomerative Methods

These methods take the bottom-up approach, merging small groups (initially each observation) into larger ones. We’ll compare 4 agglomerative clustering algorithms: Single Linkage, Complete Linkage, Average Linkage and Ward Linkage. The difference between these algorithms are the metrics each one uses to merge clusters.

  • Single Linkage: uses the minimum distance between two points in two clusters.
  • Complete Linkage: the maximum distance between two points in two clusters.
  • Average Linkage: the arithmetic mean of the distance of all the points between two clusters.
  • Ward Linkage: the squared eucledean distance between the sample mean vector of two clusters.

5.2.1.1 Single Linkage

5.2.1.2 Complete Linkage

5.2.1.3 Average Linkage

5.2.1.4 Ward Linkage

5.2.2 Conclusion

Table 5.1: Linkage methods Average Silhouette width
Method Average Silhouette Width
Single Linkage 0.151
Complete Linkage 0.380
Average Linkage 0.455
Ward Linkage 0.432

Based on the average silhouette width of all the linkage algorithms, the ward linkage method worked the best on our data. Interestingly, the single linkage only found one big cluster.

5.2.3 Divisive Methods

The divisive algorithm use a top down approach to clustering data. This means it starts with one big cluster and after each step each existing cluster is divided into two clusters. The most popular algorithm in this family is the DIvisive ANAlysis Clustering (DIANA).

5.2.3.1 Diana Method

The average silhouette width of the method was 0.38. This means it did not outperformed the average and ward linkage algorithms.

At the end, the best clustering algorithm was K-Means with as average silhouette of 0.51.